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ABSTRACT 

This paper compares a qualitative reasoning model of translation with a 
quantitative statistical model. We consider these models within the context 
of two hypothetical speech translation systems, starting with a logic-based 
design and pointing out which of its characteristics are best preserved or 
eliminated in moving to the second, quantitative design. The quantitative 
language and translation models are based on relations between lexical heads 
of phrases. Statistical parameters for structural dependency, lexical transfer, 
and linear order are used to select a set of implicit relations between words in 
a source utterance, a corresponding set of relations between target language 
words, and the most likely translation of the original utterance. 

1 Introduction 

In recent years there has been a resurgence of interest in statistical ap- 
proaches to natural language processing. Such approaches are not new, wit- 
ness the statistical approach to machine translation suggested by Weaver 
(1955), but the current level of interest is largely due to the success of ap- 
plying hidden Markov models and N-gram language models in speech recog- 
nition. This success was directly measurable in terms of word recognition 
error rates, prompting language processing researchers to seek corresponding 
improvements in performance and robustness. A speech translation system, 
which by necessity combines speech and language technology, is a natural 
place to consider combining the statistical and conventional approaches and 
much of this paper describes probabilistic models of structural language 
analysis and translation. Our aim will be to provide an overall model for 
translation with the best of both worlds. Various factors will lead us to 
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conclude that a lexicalist statistical model with dependency relations is well 
suited to this goal. 

As well as this quantitative approach, we will consider a constraint /logic 
based approach and try to distinguish characteristics that we wish to pre- 
serve from those that are best replaced by statistical models. Although 
perhaps implicit in many conventional approaches to translation, a charac- 
terization in logical terms of what is being done is rarely given, so we will 
attempt to make that explicit here, more or less from first principles. 

Before proceeding, I will first examine some fashionable distinctions in 
section |2| in order to clarify the issues involved in comparing these ap- 
proaches. I will attempt to argue that the important distinction is not 
so much a rational-empirical or symbolic-statistical distinction but rather a 
qualitative-quantitative one. This is followed by discussion of the logic-based 
model in section 0, the overall quantitative model in section ||], monolingual 
models in section translation models in section ^, and some conclusions in 
section |7[ We concentrate throughout on what information about language 
and translation is coded and how it is expressed as logical constraints or 
statistical parameters. Although important, we will say little about search 
algorithms, rule acquisition, or parameter estimation. 

2 Qualitative and Quantitative Models 

One contrast often taken for granted is the identification of a 'statistical- 
symbolic' distinction in language processing as an instance of the empirical 
vs. rational debate. I believe this contrast has been exaggerated though 
historically it has had some validity in terms of accepted practice. Rule based 
approaches have become more empirical in a number of ways: First, a more 
empirical approach is being adopted to grammar development whereby the 
rule set is modified according to its performance against corpora of natural 
text (e.g. Taylor, Grover, and Briscoe 1989). Second, there is a class of 
techniques for learning rules from text, a recent example being Brill 1993. 
Conversely, it is possible to imagine building a language model in which all 
probabilities are estimated according to intuition without reference to any 
real data, giving a probabilistic model that is not empirical. 

Most language processing labeled as statistical involves associating real- 
number valued parameters to configurations of symbols. This is not sur- 
prising given that natural language, at least in written form, is explicitly 
symbolic. Presumably, classifying a system as symbolic must refer to a 
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different set of (internal) symbols, but even this does not rule out many sta- 
tistical systems modeling events involving nonterminal categories and word 
senses. Given that the notion of a symbol, let alone an 'internal symbol', is 
itself a slippery one, it may be unwise to build our theories of language, or 
even the way we classify different theories, on this notion. 

Instead, it would seem that the real contrast driving the shift towards 
statistics in language processing is a contrast between qualitative systems 
dealing exclusively with combinatoric constraints, and quantitative systems 
that involve computing numerical functions. This bears directly on the 
problems of brittleness and complexity that discrete approaches to language 
processing share with, for example, reasoning systems based on traditional 
logical inference. It relates to the inadequacy of the dominant theories in 
linguistics to capture 'shades' of meaning or degrees of acceptability which 
are often recognized by people outside the field as important inherent prop- 
erties of natural language. The qualitative-quantitative distinction can also 
be seen as underlying the difference between classification systems based on 
feature specifications, as used in unification formalisms (Shieber 1986), and 
clustering based on a variable degree of granularity (e.g. Pereira, Tishby 
and Lee 1993). 

It seems unlikely that these continuously variable aspects of fluent nat- 
ural language can be captured by a purely combinatoric model. This natu- 
rally leads to the question of how best to introduce quantitative modeling 
into language processing. It is not, of course, necessary for the quantities 
of a quantitative model to be probabilities. For example, we may wish to 
define real-valued functions on parse trees that reflect the extent to which 
the trees conform to, say, minimal attachment and parallelism between con- 
juncts. Such functions have been used in tandem with statistical functions 
in experiments on disambiguation (for instance Alshawi and Carter 1994). 
Another example is connection strengths in neural network approaches to 
language processing, though it has been shown that certain networks are 
effectively computing probabilities (Richard and Lippmann 1991). 

Nevertheless, probability theory does offer a coherent and relatively well 
understood framework for selecting between uncertain alternatives, making 
it a natural choice for quantitative language processing. The case for prob- 
ability theory is strengthened by a well developed empirical methodology 
in the form of statistical parameter estimation. There is also the strong 
connection between probability theory and the formal theory of information 
and communication, a connection that has been exploited in speech recog- 
nition, for example using the concept of entropy to provide a motivated way 
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of measuring the complexity of a recognition problem (Jelinek et al. 1992). 

Even if probability theory remains, as it currently is, the method of 
choice in making language processing quantitative, this still leaves the field 
wide open in terms of carving up language processing into an appropriate 
set of events for probability theory to work with. For translation, a very 
direct approach using parameters based on surface positions of words in 
source and target sentences was adopted in the Candide system (Brown et 
al. 1990). However, this does not capture important structural properties of 
natural language. Nor does it take into account generalizations about trans- 
lation that are independent of the exact word order in source and target 
sentences. Such generalizations are, of course, central to qualitative struc- 
tural approaches to translation (e.g. Isabelle and Macklovitch 1986, Alshawi 
et al. 1992). 

The aim of the quantitative language and translation models presented 
in sections ||| and ^ is to employ probabilistic parameters that reflect lin- 
guistic structure without discarding rich lexical information or making the 
models too complex to train automatically. In terms of a traditional classifi- 
cation, this would be seen as a 'hybrid symbolic-statistical' system because 
it deals with linguistic structure. From our perspective, it can be seen as a 
quantitative version of the logic-based model because both models attempt 
to capture similar information (about the organization of words into phrases 
and relations holding between these phrases or their referents), though the 
tools of modeling are substantially different. 

3 Dissecting a Logic-Based System 

We now consider a hypothetical speech translation system in which the 
language processing components follow a conventional qualitative transfer 
design. Although hypothetical, this design and its components are similar 
to those used in existing database query (Rayner and Alshawi 1992) and 
translation systems (Alshawi et al 1992). More recent versions of these sys- 
tems have been gradually taking on a more quantitative flavor, particularly 
with respect to choosing between alternative analyses, but our hypothetical 
system will be more purist in its qualitative approach. 

The overall design is as follows. We assume that a speech recognition 
subsystem delivers a list of text strings corresponding to transcriptions of an 
input utterance. These recognition hypotheses are passed to a parser which 
applies a logic-based grammar and lexicon to produce a set of logical forms, 
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specifically formulas in first order logic corresponding to possible interpre- 
tations of the utterance. The logical forms are filtered by contextual and 
word-sense constraints, and one of them is passed to the translation com- 
ponent. The translation relation is expressed by a set of first order axioms 
which are used by a theorem prover to derive a target language logical form 
that is equivalent (in some context) to the source logical form. A gram- 
mar for the target language is then applied to the target form, generating a 
syntax tree whose fringe is passed to a speech synthesizer. 

Taking the various components in turn, we make a note of undesirable 
properties that might be improved by quantitative modeling. 

Analysis and Generation 

A grammar, expressed as a set of syntactic rules (axioms) G syn and a set 
of semantic rules (axioms) G sem is used to support a relation form holding 
between strings s and logical forms (f> expressed in first order logic: 

G syn U G sem \= form(s, 4>). 

The relation form is many-to-many, associating a string with linguistically 
possible logical form interpretations. In the analysis direction, we are given 
s and search for logical forms (f>, while in generation we search for strings s 
given (p. 

For analysis and generation, we are treating strings s and logical forms 
(j) as object level entities. In interpretation and translation, we will move 
down from this meta-level reasoning to reasoning with the logical forms as 
propositions. 

The list of text strings handed by the recognizer to the parser can be 
assumed to be ordered in accordance with some acoustic scoring scheme 
internal to the recognizer. The magnitude of the scores is ignored by our 
qualitative language processor; it simply processes the hypotheses one at 
a time until it finds one for which it can produce a complete logical form 
interpretation that passes grammatical and interpretation constraints, at 
which point it discards the remaining hypotheses. Clearly, discarding the 
acoustic score and taking the first hypothesis that satisfies the constraints 
may lead to an interpretation that is less plausible than one derivable from 
a hypothesis further down in the recognition list. But there is no point 
in processing these later hypotheses since we will be forced to select one 
interpretation essentially at random. 
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Syntax The syntactic rules in G syn relate 'category' predicates co,ci,C2 
holding of a string and two spanning substrings (we limit the rules here to 
two daughters for simplicity): 

co(so) A daughters(so, si, s 2 ) ^~ 

ci(si) A c 2 (s 2 ) A (so = concat(si, s 2 )) 

(Here, and subsequently, variables like so and Si are implicitly universally 
quantified.) G syn also includes lexical axioms for particular strings w con- 
sisting of single words: 



For a feature-based grammar, these rules can include conjuncts constraining 
the values, a±, a 2 , . . ., of discrete- valued functions / on the strings: 



The main problem here is that such grammars have no notion of a degree 
of grammatical acceptability - a sentence is either grammatical or ungram- 
matical. For small grammars this means that perfectly acceptable strings 
are often rejected; for large grammars we get a vast number of alternative 
trees so the chance of selecting the correct tree for simple sentences can 
get worse as the grammar coverage increases. There is also the problem of 
requiring increasingly complex feature sets to describe idiosyncrasies in the 
lexicon. 

Semantics Semantic grammar axioms belonging to G sem specify a 'com- 
position' function g for deriving a logical form for a phrase from those for 
its subphrases: 

form(s ,g((f)i,<l>2)) <- 

daughter s(s , si, s 2 ) A ci(si) A c 2 (s 2 ) A c (s ) 
A form{s 1 ,<t>\) A form(s 2 , <fo) 

The interpretation rules for strings bottom out in a set of lexical semantic 
rules associating words with predicates (pi,p 2 , ■ ■ ■) corresponding to 'word 
senses'. For a particular word and syntactic category, there will be a (small, 
possibly empty) finite set of such word sense predicates: 



ci{w), 



Cm(w). 



f(w) = ai, 



f(so) = f(si). 
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Ci{w) — > form(w,p\) 
Ci(w) -> form(w,p l m ). 

First order logic was assumed as the semantic representation language 
because it comes with well understood, if not very practical, inferential 
machinery for constraint solving. However, applying this machinery requires 
making logical forms fine grained to a degree often not warranted by the 
information the speaker of an utterance intended to convey. An example of 
this is explicit scoping which leads (again) to large numbers of alternatives 
which the qualitative model has difficulty choosing between. Also, many 
natural language sentences cannot be expressed in first order logic without 
resort to elaborate formulas requiring complex semantic composition rules. 
These rules can be simplified by using a higher order logic but at the expense 
of even less practical inferential machinery. 

In applying the grammar in generation we are faced with the problem of 
balancing over and under-generation by tweaking grammatical constraints, 
there being no way to prefer fully grammatical target sentences over more 
marginal ones. Qualitative approaches to grammar tend to emphasize the 
ability to capture generalizations as the main measure of success in linguistic 
modeling. This might explain why producing appropriate lexical collocations 
is rarely addressed seriously in these models, even though lexical collocations 
are important for fluent generation. The study of collocations for generation 
fits in more naturally with statistical techniques, as illustrated by Smajda 
and McKeown (1990). 

Interpretation 

In the logic-based model, interpretation is the process of identifying from 
the possible interpretations <\> of s for which form(s, 4>) hold, ones that are 
consistent with the context of interpretation. We can state this as follows: 

RUSUA \= 4>. 

Here, we have separated the context into a contingent set of contextual 
propositions S and a set R of (monolingual) 'meaning postulates', or selec- 
tional restrictions, that constrain the word sense predicates in all contexts. 
A is a set of assumptions sufficient to support the interpretation <p given S 
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and R. In other words, this is 'interpretation as abduction' (Hobbs et al. 
1988), since abduction, not deduction, is needed to arrive at the assumptions 
A. 

The most common types of meaning postulates in R are those for re- 
striction, hyponymy, and disjointness, expressed as follows: 

X2) — ► P2{x\) restriction; 
P2(x) -> Pz{x) hyponymy; 
^{Pz{x) /\Pa{x)) disjointness. 

Although there are compilation techniques (e.g. Mellish 1988) which allow 
selectional constraints stated in this fashion to be implemented efficiently, 
the scheme is problematic in other respects. To start with, the assumption 
of a small set of senses for a word is at best awkward because it is difficult to 
arrive at an optimal granularity for sense distinctions. Disambiguation with 
selectional restrictions expressed as meaning postulates is also problematic 
because it is virtually impossible to devise a set of postulates that will always 
filter all but one alternative. We are thus forced to under-filter and make 
an arbitrary choice between remaining alternatives. 

Logic based translation 

In both the quantitative and qualitative models we take a transfer approach 
to translation. We do not depend on interlingual symbols, but instead map 
a representation with constants associated with the source language into a 
corresponding expression with constants from the target language. For the 
qualitative model, the operable notion of correspondence is based on logical 
equivalence and the constants are source word sense predicates pi,p2,... 

and target sense predicates q± , qi , 

More specifically, we will say the translation relation between a source 
logical form (f> s and a target logical form (f> t holds if we have 

B U S U A' \= (<f> a <-> 

where B is a set of monolingual and bilingual meaning postulates, and S is 
a set of formulas characterizing the current context. A' is a set of assump- 
tions that includes the assumptions A which supported <p s . Here bilingual 
meaning postulates are first order axioms relating source and target sense 
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predicates. A typical bilingual postulate for translating between p\ and q\ 
might be of the form: 

Pb{x\) -> (pi(xi,x 2 ) <-»■ gi(xi,x 2 )). 

The need for the assumptions A' arises when a source language word 
is vaguer that its possible translations in the target language, so different 
choices of target words will correspond to translations under different as- 
sumptions. For example, the condition ps(xi) above might be proved from 
the input logical form, or it might need to be assumed. 

In the general case, finding solutions (i.e. A', 4>t pairs) for the abductive 
schema is an undecidable theorem proving problem. This can be alleviated 
by placing restrictions on the form of meaning postulates and input formulas 
and using heuristic search methods. Although such an approach was applied 
with some success in a limited-domain system translating logical forms into 
database queries (Rayner and Alshawi 1992), it is likely to be impractical for 
language translation with tens of thousands of sense predicates and related 
axioms. 

Setting aside the intractability issue, this approach does not offer a prin- 
cipled way of choosing between alternative solutions proposed by the prover. 
One would like to prefer solutions with 'minimal' sets of assumptions, but 
it is difficult to find motivated definitions for this minimization in a purely 
qualitative framework. 

4 Quantitative Model Components 
4.1 Moving to a Quantitative Model 

In moving to a quantitative architecture, we propose to retain many of the 
basic characteristics of the qualitative model: 

• A transfer organization with analysis, transfer, and generation com- 
ponents. 

• Monolingual models that can be used for both analysis and generation. 

• Translation models that exclusively code contrastive (cross-linguistic) 
information. 

• Hierarchical phrases capturing recursive linguistic structure. 
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Instead of feature based syntax trees and first-order logical forms we will 
adopt a simpler, monostratal representation that is more closely related to 
those found in dependency grammars (e.g. Hudson 1984). Dependency rep- 
resentations have been used in large scale qualitative machine translation 
systems, notably by McCord (1988). The notion of a lexical 'head' of a 
phrase is central to these representations because they concentrate on rela- 
tions between such lexical heads. In our case, the dependency representation 
is monostratal in that the relations may include ones normally classified as 
belonging to syntax, semantics or pragmatics. 

One salient property of our language model is that it is strongly lexi- 
cal: it consists of statistical parameters associated with relations between 
lexical items and the number and ordering of dependents of lexical heads. 
This lexical anchoring facilitates statistical training and sensitivity to lexi- 
cal variation and collocations. In order to gain the benefits of probabilistic 
modeling, we replace the task of developing large rule sets with the task 
of estimating large numbers of statistical parameters for the monolingual 
and translation models. This gives rise to a new cost trade-off in human 
annotation/judgement versus barely tractable fully automatic training. It 
also necessitates further research on lexical similarity and clustering (e.g. 
Pereira, Tishby and Lee 1993, Dagan, Marcus and Markovitch 1993) to 
improve parameter estimation from sparse data. 

Translation via Lexical Relation Graphs 

The model associates phrases with relation graphs. A relation graph is a 
directed labeled graph consisting of a set of relation edges. Each edge has 
the form of an atomic proposition 

r(wi,Wj) 

where r is a relation symbol, wi is the lexical head of a phrase and Wj 
is the lexical head of another phrase (typically a subphrase of the phrase 
headed by wi). The nodes Wi and Wj are word occurrences representable by 
a word and an index, the indices uniquely identifying particular occurrences 
of the words in a discourse or corpus. The set of relation symbols is open 
ended, but the first argument of the relation is always interpreted as the 
head and the second as the dependent with respect to this relation. The 
relations in the models for the source and target languages need not be the 
same, or even overlap. To keep the language models simple, we will mainly 
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restrict ourselves here to dependency graphs that are trees with unordered 
siblings. In particular, phrases will always be contiguous strings of words 
and dependents will always be heads of subphrases. 

Ignoring algorithmic issues relating to compactly representing and effi- 
ciently searching the space of alternative hypotheses, the overall design of 
the quantitative system is as follows. The speech recognizer produces a set 
of word-position hypotheses (perhaps in the form of a word lattice) corre- 
sponding to a set of string hypotheses for the input. The source language 
model is used to compute a set of possible relation graphs, with associated 
probabilities, for each string hypothesis. A probabilistic graph translation 
model then provides, for each source relation graph, the probabilities of 
deriving corresponding graphs with word occurrences from the target lan- 
guage. These target graphs include all the words of possible translations of 
the utterance hypotheses but do not specify the surface order of these words. 
Probabilities for different possible word orderings are computed according 
to ordering parameters which form part of the target language model. 

In the following section we explain how the probabilities for these var- 
ious processing stages are combined to select the most likely target word 
sequence. This word sequence can then be handed to the speech synthe- 
sizer. For tighter integration between generation and synthesis, information 
about the derivation of the target utterance can also be passed to the syn- 
thesizer. 

4.2 Integrated Statistical Model 

The probabilities associated with phrases in the above description are com- 
puted according to the statistical models for analysis, translation, and gen- 
eration. In this section we show the relationship between these models to 
arrive at an overall statistical model of speech translation. We are not 
considering training issues in this paper, though a number of now famil- 
iar techniques ranging from methods for maximum likelihood estimation to 
direct estimation using fully annotated data are applicable. 

The objects involved in the overall model are as follows (we omit target 
speech synthesis under the assumption that it proceeds deterministically 
from a target language word string): 

• A s : (acoustic evidence for) source language speech 

• W s : source language word string 
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• Wt- target language word string 

• C s : source language relation graph 

• Ct- target language relation graph 

Given a spoken input in the source language, we wish to find a target 
language string that is the most likely translation of the input. We are thus 
interested in the conditional probability of Wt given A s . This conditional 
probability can be expressed as follows (cf. Chang and Su 1993): 

P(W t \A s ) = 
Ew a ,c s ,c t P(W S \A S )P(C S \W S ,A S ) 

P(Ct\C a , W s , A s ) P(Wt\C u C s , W s , As). 

We now apply some simplifying independence assumptions concerning 
relation graphs. Specifically, that their derivation from word strings is inde- 
pendent of acoustic information; that their translation is independent of the 
original words and acoustics involved; and that target word string generation 
from target relation edges is independent of the source language representa- 
tions. The extent to which these (Markovian) assumptions hold depend on 
the extent to which relation edges represent all the relevant information for 
translation. In particular it means they should express aspects of surface 
relevant to meaning, such as topicalization, as well as predicate argument 
structure. In any case, the simplifying assumptions give the following: 

P(W t \A s ) ~ 

Ew.AA P(Ws\A s ) P(C S \W S ) P(Ct\C s ) P(W t \C t ). 
This can be rewritten with two applications of Bayes rule: 

Ew s ,c 3 ,c t P(As\Ws)(l/P(A s ))P(Ws\Cs) 
P(C S ) P(C t \C s ) P(W t \C t ). 

Since A s is given, 1/P(A S ) is a constant which can be ignored in find- 
ing the maximum of P(Wt\A s ). Determining Wt that maximizes P(Wt\A s ) 
therefore involves the following factors: 
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• P(A S \W S ): source language acoustics 

• P(W S \C S ): source language generation 

• P(C a ): source content relations 

• P(Ct\C s ): source to target transfer 

• P(Wt\Ct): target language generation 

We assume that the speech recognizer provides acoustic scores propor- 
tional to P(A S \W S ) (or logs thereof). Such scores are normally computed 
by speech recognition systems, although they are usually also multiplied by 
word-based language model probabilities P(W S ) which we do not require in 
this application context. Our approach to language modeling, which cov- 
ers the content analysis and language generation factors, is presented in 
section 
section |6 

Finally note that by another application of Bayes rule we can replace the 
two factors P(C s )P(Ct \C S ) by P(Ct)P(C s \Ct) without changing other parts 
of the model. This latter formulation allows us to apply constraints imposed 
by the target language model to filter inappropriate possibilities suggested 
by analysis and transfer. In some respects this is similar to Dagan and Itai's 
(1994) approach to word sense disambiguation using statistical associations 
in a second language. 



and the transfer probabilities fall under the translation model of 



5 Language Models 

5.1 Language Production Model 

Our language model can be viewed in terms of a probabilistic generative 
process based on the choice of lexical 'heads' of phrases and the recursive 
generation of subphrases and their ordering. For this purpose, we can define 
the head word of a phrase to be the word that most strongly influences the 
way the phrase may be combined with other phrases. This notion has been 
central to a number of approaches to grammar for some time, including 
theories like dependency grammar (Hudson 1976, 1990) and HPSG (Pollard 
and Sag 1987). More recently, the statistical properties of associations be- 
tween words, and more particularly heads of phrases, has become an active 
area of research (e.g. Chang, Luo, and Su 1992; Hindle and Rooth 1993). 
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The language model factors the statistical derivation of a sentence with 
word string W as follows: 

P(W) = E C P(C)P(W\C) 

where C ranges over relation graphs. The content model, P(C), and gen- 
eration model, P(W\C), are components of the overall statistical model for 
spoken language translation given earlier. This decomposition of P(W) can 
be viewed as first deciding on the content of a sentence, formulated as a 
set of relation edges according to a statistical model for P(C), and then 
deciding on word order according to P(W\C). 

Of course, this decomposition simplifies the realities of language pro- 
duction in that real language is always generated in the context of some 
situation S (real or imaginary), so a more comprehensive model would be 
concerned with P(C\S), i.e. language production in context. This is less 
important, however, in the translation setting since we produce Ct in the 
context of a source relation graph C s and we assume the availability of a 
model for P(C t \C s ). 

5.2 Content Derivation Model 

The model for deriving the relation graph of a phrase is taken to consist 
of choosing a lexical head h$ for the phrase (what the phrase is 'about') 
followed by a series of 'node expansion' steps. An expansion step takes a 
node and chooses a possibly empty set of edges (relation labels and ending 
nodes) starting from that node. Here we consider only the case of relation 
graphs that are trees with unordered siblings. 

To start with, let us take the simplified case where a head word h has 
no optional or duplicated dependents (i.e. exactly one for each relation). 
There will be a set of edges 

E{h) = {n{h, wi), r 2 (h, w 2 ) ... r k (h, w k )} 

corresponding to the local tree rooted at h with dependent nodes w\ . . . w k - 
The set of relation edges for the entire derivation is the union of these local 
edge sets. 

To determine the probability of deriving a relation graph C for a phrase 
headed by ho we make use of parameters ('dependency parameters') 
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P(r(h, w)\h, r) 



for the probability, given a node h and a relation r, that w is an r-dependent 
of h. Under the assumption that the dependents of a head are chosen inde- 
pendently from each other, the probability of deriving C is: 

P{C) = P(Top(h )) Ur(h, w) ec P(r(h,w)\h,r) 

where P(Top(ho)) is the probability of choosing ho to start the derivation. 

If we now remove the assumption made earlier that there is exactly one 
r-dependent of a head, we need to elaborate the derivation model to include 
choosing the number of such dependents. We model this by parameters 

P{N(r, n)\h) 

that is, the probability that head h has n r-dependents. We will refer to 
this probability as a 'detail parameter'. Our previous assumption amounted 
to stating that this was always 1 for n = 1 or for n = 0. Detail parameters 
allow us to model, for example, the number of adjectival modifiers of a noun 
or the 'degree' to which a particular argument of a verb is optional. The 
probability of an expansion of h giving rise to local edges E{h) is now: 

P(E{h)\h) = 
IL P(N(r,n r )\h) k(n r ) IW„ r P(r(/i,<)IM- 

where r ranges over the set of relation labels and h has n r r-dependents 

w\ . . .w r n . k(n r ) is a combinatoric constant for taking account of the fact 
that we are not distinguishing permutations of the dependents (e.g. there 
are n r \ permutations of the r-dependents of h if these dependents are all 
distinct). 

So if ho is the root of a tree C, we have 

P(C) = P(Top(h )) YlheheadsiC) P(E C (h)\h) 
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where heads(C) is the set of nodes in C and Ec{h) is the set of edges headed 
by h in C. 

The above formulation is only an approximation for relation graphs that 
are not trees because the independence assumptions which allow the depen- 
dency parameters to be simply multiplied together no longer hold for the 
general case. Dependency graphs with cycles do arise as the most natural 
analyses of certain linguistic constructions, but calculating their probabili- 
ties on a node by node basis as above may still provide probability estimates 
that are accurate enough for practical purposes. 

5.3 Generation Model 

We now return to the generation model P(W\C). As mentioned earlier, 
since C includes the words in W and a set of relations between them, the 
generation model is concerned only with surface order. One possibility is 
to use 'bi-relation' parameters for the probability that an r^-dependent im- 
mediately follows an ^-dependent. This approach is problematic for our 
overall statistical model because such parameters are not independent from 
the 'detail' parameters specifying the number of r-dependents of a head. 

We therefore adopt the use of 'sequencing' parameters, these being prob- 
abilities of particular orderings of dependents given that the multiset of de- 
pendency relations is known. We let the identity relation e stand for the 
head itself. Specifically, we have parameters 

P(s\M(s)) 

where s is a sequence of relation labels including an occurrence of e and 
M(s) is the multiset for this sequence. For a head h in a relation graph 
C, let swch be the sequence of dependent relations induced by a particular 
word string W generated from C. We now have 

P(W\C) = IWOL k^))P(swch\M(s WCh )) 

where h ranges over all the heads in C, and n r is the number of occurrences 
of r in swch-, assuming that all orderings of n r -dependents are equally likely. 
We can thus use these sequencing parameters directly in our overall model. 
To summarize, our monolingual models are specified by: 

• topmost head parameters P(Top{h)) 
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• dependency parameters P(r(h,w)\h,r) 

• detail parameters P(N(r, n)\h) 

• sequencing parameters P(s\M(s)) 

The overall model splits the contributions of content P(C) and ordering 
P{W\C). However, we may also want a model for P(W), for example for 
pruning speech recognition hypotheses. Combining our content and ordering 
models we get: 

P{W) = P(C) P{W\C) 

c 

= E C P(Top(h c )) J] p (*wch\h) 
haw 

Yl P(r(h,w)\h,r) 

r(h,w)£E c (h) 

The parameters P(s\h) can be derived by combining sequencing parameters 
with the detail parameters for h. 

6 Translation Model 

6.1 Mapping Relation Graphs 

As already mentioned, the translation model defines mappings between re- 
lation graphs C s for the source language and C% for the target language. 
A direct (though incomplete) justification of translation via relation graphs 
may be based on a simple referential view of natural language semantics. 
Thus nominals and their modifiers pick out entities in a (real or imaginary) 
world, verbs and their modifiers refer to actions or events in which the enti- 
ties participate in roles indicated by the edge relations. Under this view, the 
purpose of the translation mapping is to determine a target language rela- 
tion graph that provides the best approximation to the referential function 
induced by the source relation graph. We call this approximating referential 
equivalence. 

This referential view of semantics is not adequate for taking account 
of much of the complexity of natural language including many aspects of 
quantification, distributivity and modality. This means it cannot capture 



17 



some of the subtleties that a theory based on logical equivalence might be 
expected to. On the other hand, when we proposed a logic based approach 
as our qualitative model, we had to restrict it to a simple first order logic 
anyway for computational reasons, and even then it did not appear to be 
practical. Thus using the more impoverished lexical relations representation 
may not be costing us much in practice. 

One aspect of the representation that is particularly useful in the trans- 
lation application is its convenience for partial and/or incremental repre- 
sentation of content - we can refine the representation by the addition of 
further edges. A fully specified denotation of the meaning of a sentence is 
rarely required for translation, and as we pointed out when discussing logic 
representations, a complete specification may not have been intended by 
the speaker. Although we have not provided a denotational semantics for 
sets of relation edges, we anticipate that this will be possible along the lines 
developed in monotonic semantics (Alshawi and Crouch 1992). 

6.2 Translation Parameters 

To be practical, a model for P(Ct\C s ) needs to decompose the source and 
target graphs C s and Ct into subgraphs small enough that subgraph trans- 
lation parameters can be estimated. We do this with the help of 'node 
alignment relations' between the nodes of these graphs. These alignment 
relations are similar in some respects to the alignments used by Brown et 
al. (1990) in their surface translation model. The translation probability is 
then the sum of probabilities over different alignments /: 

P(Ct\c s ) = E f P(CtJ\c s ). 

There are different ways to model P(Ct, f\C s ) corresponding to different 
kinds of alignment relations and different independence assumptions about 
the translation mapping. 

For our quantitative design, we adopt a simple model in which lexical 
and relation (structural) probabilities are assumed to be independent. In 
this model the alignment relations are functions from the word occurrence 
nodes of Ct to the word occurrences of C s . The idea is that f{vj) = Wi 
means that the source word occurrence Wi 'gave rise' to the target word 
occurrence Vj. The inverse relation f^ 1 need not be a function, allowing 
different numbers of words in the source and target sentences. 
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We decompose P(Ct, f\C s ) into 'lexical' and 'structural' probabilities as 
follows: 

P(C t , f\C a ) = P(N t , f\N a )P(Et\N t , f, C s ) 

where N t and N s are the node sets for Ct and C s respectively, and E t is the 
set of edges for the target graph. 

The first factor P(Nt, f\N s ) is the lexical component in that it does not 
take into account any of the relations in the source graph C s . This lexical 
component is the product of alignment probabilities for each node of N s : 

P(N t ,f\N s ) = 

n p(r i (w i )={vi...vf}\w i ). 

Wi£N s 

That is, the probability that / maps exactly the (possibly empty) subset 

{vj . . . vf} of N t to Wi. These sets are assumed to be disjoint for different 
source graph nodes, so we can replace the factors in the above product with 
parameters: 

P{M\w) 

where jo is a source language word and M is a multiset of target language 
words. 

We will derive a target set of edges Et of Ct by k derivation steps which 
partition the set of source edges E s into subgraphs S± . . . S^. These sub- 
graphs give rise to disjoint sets of relation edges T\ . . . which together 
form E t . The structural component of our translation model will be the 
sum of derivation probabilities for such an edge set Et- 

For simplicity, we assume here that the source graph C s is a tree. This 
is consistent with our earlier assumptions about the source language model. 
We take our partitions of the source graph to be the edge sets for local trees. 
This ensures that the the partitioning is deterministic so the probability of 
a derivation is the product of the probabilities of derivation steps. More 
complex models with larger partitions rooted at a node are possible but 
these require additional parameters for partitioning. For the simple model 
it remains to specify derivation step probabilities. 

The probability of a derivation step is given by parameters of the form: 
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pmis'iji) 



where S[ and T( are unlabeled graphs and is a node alignment function 
from T[ to S[. Unlabeled graphs are just like our relation edge graphs except 
that the nodes are not labeled with words (the edges still have relation 
labels) . To apply a derivation step we need a notion of graph matching that 
respects edge labels: g is an isomorphism (modulo node labels) from a graph 
G to a graph H if g is a one-one and onto function from the nodes of G to 
the nodes of H such that 

r(a,b) EG iff r(g(a), g(b)) eH. 

The derivation step with parameter PiT'^S'^ /j) is applicable to the 
source edges Si, under the alignment /, giving rise to the target edges Tj if 
(i) there is an isomorphism hi from to Si (ii) there is an isomorphism g± 
from Tj to T[ (iii) for any node v of Tj it must be the case that 

hiifMv))) = f(v). 

This last condition ensures that the target graph partitions join up in a way 
that is compatible with the node alignment /. 

The factoring of the translation model into these lexical and structural 
components means that it will overgenerate because these aspects are not 
independent in translation between real natural languages. It is therefore 
appropriate to filter translation hypotheses by rescoring according to the ver- 
sion of the overall statistical model that included the factors P{Ct)P{C s \Ct) 
so that the target language model constrains the output of the translation 
model. Of course, in this case we need to model the translation relation in 
the 'reverse' direction. This can be done in a parallel fashion to the forward 
direction described above. 

7 Conclusions 

Our qualitative and quantitative models have a similar overall structure 
and there are clear parallels between the factoring of logical constraints and 
statistical parameters, for example monolingual postulates and dependency 
parameters, bilingual postulates and translation parameters. The paral- 
lelism would have been closer if we had adopted ID/LP style rules (Gazdar 
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et al. 1985) in the qualitative model. However, we argued in section |3| that 
our qualitative model suffered from lack of robustness, from having only the 
crudest means for choosing between competing hypotheses, and from being 
computationally intractable for large vocabularies. 

The quantitative model is in a much better position to cope with these 
problems. It is less brittle because statistical associations have replaced 
constraints (featural, selectional, etc.) that must be satisfied exactly. The 
probabilistic models give us a systematic and well motivated way of rank- 
ing alternative hypotheses. Computationally, the quantitative model lets 
us escape from the undecidability of logic-based reasoning. Because this 
model is highly lexical, we can hope that the input words will allow effective 
pruning by limiting the number of search paths having significantly high 
probabilities. 

We retained some of the basic assumptions about the structure of lan- 
guage when moving to the quantitative model. In particular, we preserved 
the notion of hierarchical phrase structure. Relations motivated by depen- 
dency grammar made it possible to do this without giving up sensitivity to 
lexical collocations which underpin simple statistical models like N-grams. 
The quantitative model also reduced overall complexity in terms of the sets 
of symbols used. In addition to words, it only required symbols for de- 
pendency relations, whereas the qualitative model required symbol sets for 
linguistic categories and features, and a set of word sense symbols. Despite 
their apparent importance to translation, the quantitative system can avoid 
the use of word sense symbols (and the problems of granularity they give 
rise to) by exploiting statistical associations between words in the target 
language to filter implicit sense choices. 

Finally, here is a summary of our reasons for combining statistical meth- 
ods with dependency representations in our language and translation mod- 
els: 

• inherent lexical sensitivity of dependency representations, facilitating 
parameter estimation; 

• quantitative preference based on probabilistic derivation and transla- 
tion; 

• incremental and/or partial specification of the content of utterances, 
particularly useful in translation; 

• decomposition of complex utterances through recursive linguistic struc- 
ture. 
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These factors suggest that dependency grammar will play an increasingly 
important role as language processing systems seek to combine both struc- 
tural and collocational information. 
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